Jan Piotrowski Homework 3

I chose the Lime package (GitHub Repository) and followed the tutorial for continuous tabular data from the official library documentation (Tutorial - Continuous and Categorical Features).

From the first assignment, I obtained the SpeedDating imbalanced dataset with the following features:

  • pref_o_attractive: How important does the partner rate attractiveness
  • pref_o_sincere: How important does the partner rate sincerity
  • pref_o_intelligence: How important does the partner rate intelligence
  • pref_o_funny: How important does the partner rate being funny
  • pref_o_ambitious: How important does the partner rate ambition
  • pref_o_shared_interests: How important does the partner rate having shared interests
  • attractive_o: Rating by the partner (about me) at the night of the event on attractiveness
  • intelligence_o: Rating by the partner (about me) at the night of the event on intelligence
  • funny_o: Rating by the partner (about me) at the night of the event on being funny
  • shared_interests_o: Rating by the partner (about me) at the night of the event on shared interest
  • attractive_important: What do you look for in a partner - attractiveness
  • sincere_important: What do you look for in a partner - sincerity
  • intellicence_important: What do you look for in a partner - intelligence
  • funny_important: What do you look for in a partner - being funny
  • ambtition_important: What do you look for in a partner - ambition
  • shared_interests_important: What do you look for in a partner - shared interests
  • interests_correlate: Correlation between participant’s and partner’s ratings of interests.
  • like: Did you like your partner?
  • TARGET: Match or not.

This is interesting because the most important feature should be like, but the model can provide other insights.

I retrieved models from the first Homework: Logistic Regression and XGBoost.

The Lime explanations for positive and negative instances from the test set are below.

Logistic Regression

In [33]:
# Show positive examples and their explanations for logistic regression 

for positive_example in positive_examples:
    exp = explainer.explain_instance(positive_example, clf_logistic_regression.predict_proba, num_features=10)
    exp.show_in_notebook(show_table=True, show_all=False)
In [35]:
# show negative examples and their explanations for logistic regression 

for negative_example in negative_examples:
    exp = explainer.explain_instance(negative_example, clf_logistic_regression.predict_proba, num_features=10)
    exp.show_in_notebook(show_table=True, show_all=False)

The results don't look stable, because the most important factors in general differ in each example.

XGBoost

In [36]:
# Show positive examples and their explanations for xgboost 

for positive_example in positive_examples:
    exp = explainer.explain_instance(positive_example, clf_xgboost.predict_proba, num_features=10)
    exp.show_in_notebook(show_table=True, show_all=False)
In [37]:
# show negative examples and their explanations for xgboost 

for negative_example in negative_examples:
    exp = explainer.explain_instance(negative_example, clf_xgboost.predict_proba, num_features=10)
    exp.show_in_notebook(show_table=True, show_all=False)

We can observe that the results for XGBoost are superior to those of logistic regression. For instance, the like feature is consistently one of the most important in all explanations for XGBoost, which is not the case for logistic regression. The shared_interests_o feature is the second feature that is systematically most important across explanations for XGBoost, as opposed to logistic regression. Positively, both like and shared_interests_o intuitively should be important features.

Appendix

In [1]:
import os

while 'Homeworks' in os.getcwd():
    os.chdir('..')
In [18]:
# pip install pandas matplotlib scikit-learn xgboost plotly lime
import requests
import pandas as pd
import matplotlib.pyplot as plt
import xgboost as xgb
import lime
import random

from sklearn.metrics import roc_auc_score, roc_curve
from sklearn.model_selection import train_test_split
In [7]:
url = 'https://raw.githubusercontent.com/adrianstando/imbalanced-benchmarking-set/main/datasets/SpeedDating.csv'
response = requests.get(url)
if response.status_code == 200:
    with open('SpeedDating.csv', 'wb') as f:
        f.write(response.content)
else:
    print(f"Failed to download the file. Status code: {response.status_code}")
    
path = "SpeedDating.csv"
# first column is the index
df = pd.read_csv(path, index_col=0)
df
Out[7]:
pref_o_attractive pref_o_sincere pref_o_intelligence pref_o_funny pref_o_ambitious pref_o_shared_interests attractive_o intelligence_o funny_o shared_interests_o attractive_important sincere_important intellicence_important funny_important ambtition_important shared_interests_important interests_correlate like TARGET
0 35.0 20.0 20.0 20.0 0.0 5.0 6.0 8.0 8.0 6.0 15.0 20.0 20.0 15.0 15.0 15.0 0.14 7.0 0
1 60.0 0.0 0.0 40.0 0.0 0.0 7.0 10.0 7.0 5.0 15.0 20.0 20.0 15.0 15.0 15.0 0.54 7.0 0
3 30.0 5.0 15.0 40.0 5.0 5.0 7.0 9.0 8.0 8.0 15.0 20.0 20.0 15.0 15.0 15.0 0.61 7.0 1
4 30.0 10.0 20.0 10.0 10.0 20.0 8.0 9.0 6.0 7.0 15.0 20.0 20.0 15.0 15.0 15.0 0.21 6.0 1
5 50.0 0.0 30.0 10.0 0.0 10.0 7.0 8.0 8.0 7.0 15.0 20.0 20.0 15.0 15.0 15.0 0.25 6.0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1836 15.0 15.0 20.0 25.0 10.0 15.0 2.0 6.0 2.0 1.0 25.0 10.0 20.0 20.0 10.0 15.0 0.35 5.0 0
1837 15.0 15.0 25.0 25.0 15.0 5.0 4.0 8.0 3.0 2.0 25.0 10.0 20.0 20.0 10.0 15.0 0.45 5.0 0
1838 20.0 20.0 20.0 20.0 10.0 10.0 5.0 4.0 5.0 3.0 25.0 10.0 20.0 20.0 10.0 15.0 0.13 5.0 0
1840 15.0 15.0 25.0 25.0 20.0 0.0 4.0 7.0 3.0 0.0 25.0 10.0 20.0 20.0 10.0 15.0 0.54 5.0 0
1843 10.0 10.0 35.0 35.0 8.0 2.0 7.0 6.0 5.0 2.0 25.0 10.0 20.0 20.0 10.0 15.0 0.54 5.0 0

1048 rows × 19 columns

In [8]:
# train test split
train, test = train_test_split(df, test_size=0.8)
X_train, y_train = train.drop(columns=['TARGET']), train['TARGET']
X_test, y_test = test.drop(columns=['TARGET']), test['TARGET']
len(X_train), len(X_test)
Out[8]:
(209, 839)
In [28]:
# Train and evaluate logistic regression from sklearn
from sklearn.linear_model import LogisticRegression
clf_logistic_regression = LogisticRegression()
clf_logistic_regression.fit(X_train.values, y_train.values)

y_pred_logistic_regression = clf_logistic_regression.predict_proba(X_test)[:, 1]
/home/janek/.local/lib/python3.10/site-packages/sklearn/linear_model/_logistic.py:460: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/home/janek/.local/lib/python3.10/site-packages/sklearn/base.py:458: UserWarning: X has feature names, but LogisticRegression was fitted without feature names
  warnings.warn(
In [10]:
# Train and evaluate XGBoost
clf_xgboost = xgb.XGBClassifier()
clf_xgboost.fit(X_train, y_train)

y_pred_xgboost = clf_xgboost.predict_proba(X_test)[:, 1]
In [14]:
from lime import lime_tabular
explainer = lime_tabular.LimeTabularExplainer(X_train.values, feature_names=X_train.columns, class_names=['0', '1'], discretize_continuous=True)
In [30]:
X_test_pos = X_test[y_test == 1]
y_test_pos = y_test[y_test == 1]

X_test_neg = X_test[y_test == 0]
y_test_neg = y_test[y_test == 0]

positive_examples = [X_test_pos.values[random.randint(0, len(X_test_pos))] for _ in range(0, 5)]
negative_examples = [X_test_neg.values[random.randint(0, len(X_test_neg))] for _ in range(0, 5)]
In [31]:
# show positive examples and their explanations for logistic regression 

for positive_example in positive_examples:
    print("LogisticRegression")
    exp = explainer.explain_instance(positive_example, clf_logistic_regression.predict_proba, num_features=10)
    exp.show_in_notebook(show_table=True, show_all=False)
Logistic regression: [[0.86617326 0.13382674]]
XGBoost: [[0.92091334 0.07908667]]
LogisticRegression
xgboost


Logistic regression: [[0.61436705 0.38563295]]
XGBoost: [[0.9818032  0.01819682]]
LogisticRegression
xgboost


Logistic regression: [[0.58895066 0.41104934]]
XGBoost: [[0.16738784 0.83261216]]
LogisticRegression
xgboost


Logistic regression: [[0.67285537 0.32714463]]
XGBoost: [[0.921117   0.07888298]]
LogisticRegression
xgboost


Logistic regression: [[0.83065681 0.16934319]]
XGBoost: [[0.9972487  0.00275129]]
LogisticRegression
xgboost


In [ ]: